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In the framework of on-line learning, a learning machine might move around a teacher 
due to the differences in structures or output functions between the teacher and the learning 
machine. In this paper we analyze the generalization performance of a new student super- 
vised by a moving machine. A model composed of a fixed true teacher, a moving teacher, 
and a student is treated theoretically using statistical mechanics, where the true teacher is a 
nonmonotonic perceptron and the others are simple perceptrons. Calculating the generaliza- 
tion errors numerically we show that the generalization errors of a student can temporarily 
become smaller than that of a moving teacher, even if the student only uses examples from 
£vq ■ the moving teacher. However, the generalization error of the student eventually becomes the 

same value with that of the moving teacher. This behavior is qualitatively different from 
f^) ■ that of a linear model. 
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1. Introduction 

Learning is to infer the underlying rules that dominate data generation using observed 
data. The observed data are input-output pairs from a teacher and are called examples. 
Learning can be roughly classified into batch learning and on-line learning. 1 In batch learning, 
some given examples are used more than once, a paradigm in which a student comes to give 
correct answers after training if that student has an adequate degree of freedom. However, it 
is necessary to have a long amount of time and a large memory in which many examples may 
be stored. On the contrary, examples used once are discarded in ondine learning. In this case, 
a student cannot give correct answers for all examples used in training. However, there are 
some merits: for example, a large memory for storing many examples is not necessary and it 
is possible to follow a time-variant teacher. 
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Recently, we 6,7 have analyzed the generalization performance of ensemble learning in a 
framework of on-line learning using a statistical mechanical method. 1 ' 8 In that process, the 
following points were proven subsidiarily: The generalization error does not approach zero 
when the student is a simple perceptron and the teacher is a committee machine 12 or a non- 
monotonic perceptron. 13 Therefore, models like these can be called unlearnable cases. 9-11 The 
behavior of a student in an unlearnable case depends on the learning rule. That is, the student 
vector asymptotically converges in one direction using Hebbian learning. On the contrary, the 
student vector does not converge in one direction but continues moving using perceptron 
learning or AdaTron learning. In the case of a non-monotonic teacher, the student's behavior 
can be expressed by continuing to go around the teacher, keeping a constant direction cosine 
with the teacher. 

Considering the applications of statistical learning theories, investigating the system be- 
haviors of unlearnable cases is significant since real-world problems seem to include many 
unlearnable cases. In addition, a learning machine may continue going around a teacher in the 
unlearnable cases as mentioned above. Here, let us consider a new student that is supervised 
by a moving learning machine. That is, we consider a student that uses the input-output pairs 
of a moving teacher as training examples, and we investigate the generalization performance 
of a student for a true teacher. Here, the true teacher is fixed. Note that the examples used 
by the student are only from the moving teacher, and the student cannot directly observe 
the outputs of the true teacher. In a real human society, a teacher that can be observed by 
a student does not always present the correct answer; in many cases, the teacher is learning 
and continues to vary. Therefore, analyzing such a model is interesting for considering the 
analogies between statistical learning theories and a real society. 

A model in which a true teacher, a moving teacher, and a student are all linear perceptrons 6 
with noises was already solved analytically. 14 It was proved that a student's generalization 
errors can be smaller than that of the moving teacher in the linear case even though the student 
uses only the examples of the moving teacher. However, linear perceptrons are somewhat 
special as neural networks or learning machines. Nonlinear perceptrons are more common 
than linear ones. Therefore, in this paper we treat a model in which a true teacher, a moving 
teacher, and a student are all nonlinear perceptrons. We calculate the order parameters and 
the generalization errors in the case of a true teacher as nonmonotonic while the others are 
simple perceptrons theoretically using a statistical mechanical method in the framework of 
on-line learning. As a result, it is proved that a student's generalization errors can be smaller 
than that of the moving teacher. That means the student can be cleverer than the moving 
teacher even though the student uses only the examples from the moving teacher. Although 
these behaviors are analogous to those of a linear model, the generalization error of the student 
eventually becomes the same value as that of the moving teacher in the nonlinear model. 
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Three nonlinear perceptrons are treated in this paper: a true teacher, a moving teacher 
and a student. Their connection weights are A, B, and J, respectively. For simplicity, the 
connection weights of the true teacher, that of the moving teacher and that of the student 
are simply called the true teacher, the moving teacher, and the student, respectively. The 
true teacher A = (Ai, . . . , An), the moving teacher B = (B±, . . . , Bn), the student J = 
(Ji, . . . , J/v), and input x = (x±, . . . ,xjy) are iV-dimensional vectors. Each component Ai 
of A is drawn from AA(0, 1) independently and fixed, where AA(0, 1) denotes the Gaussian 
distribution with a mean of zero and a variance of unity. Each of the components B®, J? of 
the initial values of B, J are drawn from AA(0, 1) independently. Each component X{ of x is 
drawn from A/"(0, 1/N) independently. Thus, 

(A) = 0, ((A t f) = 1, (1) 

<*?) = 0, <(^°) 2 > = 1, (2) 

(j?)=o, m 2 )=i, (3) 

(xi)=0, <(^) 2 > = ^, (4) 

where (•) denotes a mean. 

In this paper, the thermodynamic limit N — > oo is also treated. Therefore, 

\\A\\=VN, H-B !^^, iij !^^, II so II = 1, (5) 

where || • || denotes a vector norm. Generally, norms ||JB|| and ||«7|| of the moving teacher and 
the student change as the time step proceeds. Therefore, the ratios Ib and lj of the norms 
to \fN are introduced and are called the length of the moving teacher and the length of the 
student. That is, \\B\\ = l B ^NC \\J\\ = IjVN. 

The internal potentials y of the true teacher, vIb of the moving teacher, and ul j of the 
student are 

y = Ax, (6) 

vl B = B x, (7) 

ulj = J ■ x, (8) 

where y, v, and u obey the Gaussian distributions with means of zero and variances of unity. 
The output of the true teacher, which has a nonmonotonic output function, is 

d = sgn({y-a)y(y + a)), (9) 

where a is a fixed threshold of the nonmonotonic function. The outputs of the moving teacher 
and the student, which are simple perceptrons, are sgn(vl B ) and sgn(nZj), respectively. Here, 
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sgn(-) is a sign function defined as 

sgn(.) = {«' (10) 
[ -1, z < 0. 

In the model treated in this paper, the moving teacher B is updated using an input x and 
an output of the true teacher A for the input x. The student J is updated using an input x 
and an output of the moving teacher B for the input x. The moving teacher is considered to 
use perceptron learning. That is, 

B m+1 = B ™ + 7]B Q(_ v m dm - )d m xm (n) 

= B m + r !B Q(-v m (y m -a)y m (y m + a))sgn((y m - a)y m (y m + a))x m , (12) 

where t]b denotes the learning rate of the moving teacher and is a constant number. Further- 
more, m denotes the time step, and 0(-) denotes the step function defined as 

e W = { «• ' * ° Q (13, 

[0, z < 0. 

The student is also considered to use perceptron learning. That is, 

where r\j denotes the student's learning rate and is a constant number. Generalizing the 
learning rules, Eqs. (12) and (14) can be expressed as 

B m+1 = B m + g m x m^ ^ 
jm+l = jm + f m xm ^ ^ 

respectively. Here, g and / are update functions of the moving teacher and the student, 
respectively. 

3. Theory 

3.1 Generalization Error 

A goal of a statistical learning theory is to theoretically obtain generalization errors. We 

use 

e - = 0(-d m sgn(v m l%)) (17) 

= Q(-(y m - a )y m {y m + a)v m ) (18) 

and 

e J = Q (-d m sgn{u m l' i y)) (19) 

= Q(-(y m -a)y m (y m + a)u m ) (20) 
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as errors of the moving teacher and student, respectively. The superscripts m, which represent 
the time steps, are omitted for simplicity. We define a generalization error as a mean of error 
over the distribution p{x) of inputs x. The error e B of the moving teacher and the error 
ej of the student can be expressed as e B (y,v) and ej(y,u) using y,v, and u, Therefore, the 
generalization error e gB of the moving teacher and the generalization error e g j of the student 
can be calculated using the distributions p(y, v) and p(y, u) as follows: 

e 9 B = {cb)x (21) 
= J dxp(x)e B (22) 

dydvp(y,v)e B (y,v), (23) 
<<,! - (ej) x (24) 

(25) 



= J dydup(y,u)ej(y,u). (26) 

Since y,v and u are calculated using A,B,J, and the independent input x, p(y,v,u) is the 
multiple Gaussian distribution with means of zero and the covariance matrix S 

/ 1 Rb Rj ^ 
S = -Rb 1 Rbj (27) 
\ .Rj Rbj 1 / 

Here, i?^ is the direction cosine between A and B. Rj is the direction cosine between A and 
J. is the direction cosine between B and J. Thus, 

Rb = (28) 

= HFf (29) 

Abj = pfjipip ( 30 ) 

Equations (23) and (26) can be calculated by excuting the Gaussian integrations using these 
direction cosines as follows: 9-11 



e gB = DyH + DyH , (31) 




gJ = I DyH ^=J= + / DyH JL_^ , (32) 
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where 



H(U) EE f 



J u 



oo 



Dy, Dy 



dy 




) 



(33) 



The relationship among the true teacher A, the moving teacher B, and the student J is shown 
in Fig. 1. 



Fig. 1. True teacher A, moving teacher B, and student J. Rb,Rj, and Rbj are direction cosines. 



3. 2 Differential equations of order parameters 

Since we treat the thermodynamic limit N — » oo in this paper, O(N) updates of Eqs. 
(12) and (14) are necessary for the order parameters to change 0(1). Therefore, we denote 
time steps m normalized by the dimension N as a continuous time t = m/N. We use t as a 
subscript for the learning process. 

The generalization errors e g B and e g j can be calculated if all the order parameters Rb, 
Rj and Rbj are known. Therefore, simultaneous differential equations in deterministic forms 8 
have been obtained that describe the dynamical behaviors of order parameters based on self- 
averaging in the thermodynamic limits as follows: 14 



A 





(9 2 ) 



(34) 



21 B 



dlj it \ , 
-dt = {fu) + 



if) 



(35) 



2lj 
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dRBJ = -Rbj ( ±% + ^) + U 9 u) + Ufv) + (36) 



dt \lj dt Ib dt J Ib lj IbIj 

dRj iC-^ + wV (37) 



dt lj \ dt 

jflg _ to) - W«» _ (38) 

As mentioned above, y, u, and u obey the triple Gaussian distribution with means of zero 
and the covariance matrix of Eq. (27). Using this, we can calculate the nine sample averages 
that appear in Eqs. (34)-(38) as follows: 

f „ f n fa 2 



(gv) = IR B ( 2exp f -- ) - 1 ) - 1 ) , (39) 




(f u ) = r,j22±Z±, (41) 



2vr 

(/ 2 ) = Jtan- 1 ^— ^, (42) 
Vb ( D /„ „„„ / a 2 



< 5 n> = -|= ( tfj ( 2exp ( -- ) - 1 ) - R B j ) , (43) 



</„> = VJ^A 1 ' ( 44 ) 



2vr 



Jo Dy + 1 Dy ) / »« H DvH 



yRjJl -R 2 B + v(R B Rj - R BJ ) 



V yl + 2R B RjR B j-Rl-Rj-Rl Ji 

(45) 

(fy) = VJ^=A, (46) 



2vr 

{gy) = J^ = (2ex P (~)-l-R B ). (.17) 



^7T V V 2 

4. Results and discussion 

Figures 2-5 illustrate the dynamical behaviors of the generalization errors and the order 
parameters. The threshold a of the true teacher is 0.5 and the learning rate r\B of the moving 
teacher is 0.1. In these figures, the curves represent the theoretical results and the symbols 
represent the simulation results, where N = 10 4 . In theoretical calculations, the simultaneous 
differential equations have been solved numerically using the sample averages in Eqs. (34)-(38) 
also obtained numerically. The generalization errors e g s and e g j are calculated by executing 
integrations in Eqs. (31) and (32) numerically using the obtained Rb,Rj, and Rbj- In the 
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computer simulations, the generalization errors have been measured through tests using 10 5 
random inputs at each time step. In these figures, the theoretical results and the computer 
simulations closely agree with each other. 

Figure 2 shows that the student's generalization error e g j is always larger than e g B of the 
moving teacher when the student's learning rate rjj is relatively large, for example r\j = 1.0. 
In that case, e g j approaches e g s asymptotically. On the other hand, e g j temporarily becomes 
smaller than e g s when the learning rate rjj is relatively small, for example rjj = 0.2, 0.05 or 
0.01. This is an interesting phenomenon since the student can temporarily become cleverer 
than the moving teacher even though the student uses only the examples from the moving 
teacher. This is the same as the linear case 14 whereby e g j can become smaller than e g s- In 
the linear case, 14 a small e g j is maintained after e g j becomes smaller than e g s- However, e g j 
returns to the same value as e g B in the nonlinear case treated in this paper. This behavior 
is interesting since it is qualitatively different from the linear case. In addition, the overshot 
of e g j occurs only once when rjj = 0.2. On the other hand, e g j swings three times when 
i]j = 0.05,0.01. 

Figure 3 shows that Rj temporarily becomes larger than Rb when rjj is small. This means 
that J comes closer to A than B. Although the overshot of Rj occurs only once, e g j swings 
three times when rjj = 0.05, 0.01. The reason for this difference can be understood as follows. 
In the case of a nonmonotonic teacher, the relationship between the generalization error e g 
and the direction cosine R is Eq. (31) or (32). 9 " 11 In the case of a < V2 In 2 = 1.18, e g is not a 
monotonic function of R and takes a minimum value when R = ^(2^2 - a 2 )/(21n2). since 
a = 0.5 is treated in this section, e g takes a minimum value when R = 0.905. The theoretical 
curves of rjj = 0.05, 0.01 in Fig. 3 indicate that R agrees with 0.905 twice. This phenomenon 
corresponds to the two local minima in Fig. 2. On the other hand, R does not reach 0.905 
when r/j = 0.2. Therefore, the number of the minimum of e g is also only one. 

Figure 3 shows that the maximum value of Rj is unity when rjj = 0.01. This is also a very 
interesting phenomenon since the direction cosine between a teacher and a student does not 
reach unity when the student learns the nonmonotonic teacher using perceptron learning. 11 

In addition, Rb and Rj agree with each other after enough time steps. However, the 
moving teacher and the student do not coincide with each other. That is, Rbj is smaller than 
unity as shown in Fig. 4. 

5. Conclusion 

In the framework of on-line learning, a learning machine might move around a teacher due 
to the differences in structures or output functions between the teacher and the learning ma- 
chine. In this paper we analyzed the generalization performance of a new student supervised 
by a moving machine. A model composed of a fixed true teacher, a moving teacher, and a 
student was treated theoretically using statistical mechanics, where the true teacher is a non- 
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0.01 0.1 1 10 100 1000 

t=m/N 



Fig. 2. Dynamical behaviors of e g B and e g j. Conditions arc a = 0.5 and t\b = 0.1. Curves represent 
theoretical results and symbols represent simulation results, where N — 10 4 . 




0.01 0.1 1 10 100 1000 

t=m/N 



Fig. 3. Dynamical behaviors of Rb and Rj. Conditions are a = 0.5 and rjs = 0.1. Curves represent 
theoretical results and symbols represent simulation results, where N = 10 4 . 

monotonic perceptron and the others are simple perceptrons. Calculating the generalization 
errors numerically, we have shown that a student's the generalization error can temporarily 
become smaller than that of a moving teacher, even if the student only uses examples from 
the moving teacher. However, the student's generalization error eventually becomes the same 
value as that of the moving teacher. This behavior is qualitatively different from that of a 
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Fig. 4. Dynamical behaviors of Rbj- Conditions are a = 0.5 and t\b = 0.1. Curves represent theo- 
retical results and symbols represent simulation results, where N = 10 4 . 
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Fig. 5. Dynamical behaviors of Ib and lj. Conditions arc a = 0.5 and tjb = 0.1. Curves represent 
theoretical results and symbols represent simulation results, where N = 10 4 . 



linear model. 
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